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Abstract 

In this technical report, we address the problem of recovering 3-D models from sequences of 
uncalibrated images with unknown correspondence. To that end, we integrate tracking, structure 
from motion with geometric constraints, and use of deformable 3-D models in a single framework. 
The key to making the proposed approach work is the use of appearance-based model matching 
and refinement. 

This appearance-based constrained structure from motion (AbCSfm) approach is especially 
useful in recovering shapes of objects whose general structure is known but which may have little 
discernable texture in significant parts of their surfaces. We applied the proposed approach to 3- 
D face modeling from multiple images to create new 3-D faces for DECface, a synthetic talking 
head developed at Cambridge Research Laboratory, Digital Equipment Corporation. The DECface 
model comprises a collection of 3-D triangular and rectangular facets, with nodes as vertices. In 
recovering the DECface model, we assume that the sequence of images is taken with a camera 
with unknown camera focal length and extrinsic parameters (i.e., camera pose). Results of this 
approach show its good convergence properties and its robustness against cluttered backgrounds. 
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1 Introduction 

The classical approach to recovering 3-D structure from a sequence of images is to calibrate the 
camera, track the features across the sequence, and then apply stereo techniques using the tracked 
features. More recent techniques allow 3-D structures to be recovered without explicit camera 
calibration. Nevertheless, the processes of feature tracking and structure from motion are almost 
always separate. 

In this technical report, we propose an approach that integrates tracking, structure from mo- 
tion with geometric constraints, and use of deformable 3-D models in a single framework. The 
input image sequence is assumed uncalibrated, and the image correspondences are also assumed 
not known. The key to making the proposed approach work is the use of appearance-based model 
matching and refinement. Another distinguishing feature of this approach is that feature corre- 
spondences are not statically determined; they may "drift" over time according to how well they 
satisfy both local image similarity and 3-D geometric constraints. 

This appearance-based constrained structure from motion (AbCSfm) approach is especially 
useful in recovering shapes of objects whose general structure is known but which may have little 
discernable texture in significant parts of their surfaces. A good example of such an object is the 
human face, where there is usually a significant amount of relatively untextured regions (espe- 
cially if there is little facial hair) and where the facial structure is known. We applied the proposed 
approach to 3-D face modeling from multiple images to create new 3-D faces for DECface, a syn- 
thetic talking head developed at Cambridge Research Laboratory, Digital Equipment Corporation. 
The DECface model comprises a collection of 3-D triangular and rectangular facets, with nodes as 
vertices. In recovering the DECface model, we assume that the sequence of images are taken with 
a camera with unknown camera focal length and extrinsic parameters (i.e., camera pose). 

In our current implementation, we use the frontal shot of the face as the reference image and im- 
pose a line-of-sight constraint of 3-D facial nodes using this reference image. We also constrained 
3-D model deformation by minimizing an objective function that trade-off minimal change in local 
curvature and node position with fit to predicted point correspondences and face appearance. 

1.1 Prior work 

There is a large body of work on the recovery of raw 3-D data from multiple images; they include 
multibaseline stereo [14], trinocular stereo that combines constant brightness constraint with trilin- 
ear tensor (small displacements, only three images) [19], stereo with interpolation [4], and shape 
from rotation [21, 30]. In a work that unifies image matching with stereo, Xu and Zhang [29] use 
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initially extracted correspondence to estimate the epipolar geometry using a robust estimator. The 
computed epipolar geometry is then used to recover more correspondences as in classical stereo 
matching. 

Virtually all stereo approaches assume fixed disparity throughout once it has been established, 
e.g., through a separate feature tracker or image registration technique. Most techniques assume 
that the camera parameters, intrinsic and extrinsic, are known. Our proposed method integrates the 
tracker with structure and motion recovery, and does not assume that the focal length is known. In 
theory, for general camera motion with constant intrinsic parameters, three views are sufficient to 
recover structure, camera motion, and all five camera intrinsic parameters [7, 20]. For algorithmic 
stability, we assume only one unknown intrinsic camera parameter, namely the focal length. The 
aspect ratio is assumed to be unity, the image skew to be insignificant, and the principal point to 
be coincident with the center of the image. 

The approaches specific to face modeling can be partitioned into two categories based on the 
input, namely range and image data, and images only. In an approach that uses both range and 
image data, Lee et al. [11] use dense 3-D data from Cyberware Color Digitizer™, and apply 
3-D feature-based matching (for facial features such as the nose, chin, ears, eyes) to initialize their 
3-D adaptable facial mesh. This facial mesh is subsequently augmented with a dynamic model 
of facial tissue controlled by facial muscles. Kang et al. [10] use as input both range image and 
corresponding color image of the face. They use color-based 2-D facial feature detection methods 
to locate the eyes, eyebrows, and mouth. The feature detection involve computing edges in color 
space followed by contour extraction and smoothing by dilation and shrinking. 

The simpliest case of techniques using only images as input involves only two orthogonal 
views (namely, the front and side views) of the face. Extraction of 3-D face model would then 
entail profile analysis, identification of facial features from contours, and adjustment of a 3-D face 
template through interpolation [1,8]. 

Lengagne et al. use a calibrated stereo pair and use the dense disparity map computed through 
an interpolation technique [4]. In their approach, the 3-D deformation of the face model is guided 
by differential features that have high curvature values (such as the nose and eye orbits). 

Two representative work that use as input a sequence of face images to refine a 3-D face model 
are those of DeCarlo and Metaxas [2] and Jebara and Pentland [9]. The first method uses optical 
flow in an image sequence to move and deform the face model [2] for expression tracking. Facial 
anthropometric data is used to limit facial model deformations in the initialization and during 
tracking. The focal length of the camera is assumed to be known approximately. In the second 
method, the eyes, nose and mouth are tracked, and the structure and motion of the face is estimated 



1.2 Organization 



3 



using recursive Kalman filtering [9]. The deformation of the face shape is constrained by linear 
subspace of eigenvectors as a result of Singular Value Decomposition (SVD) of sample face shapes. 
In this case, the whole face is not tracked. In a more general approach, Fua and Leclerc [5] 
reconstruct both shape and reflectance properties of surfaces from multiple images. The surface 
shape is initialized by conventional stereo, and is deformed while minimizing an objective function 
that is a weighted sum of stereo, shading, and smoothness constraints. 

1.2 Organization 

In section 2, we describe in detail our approach which we call appearance-based constrained 
structure from motion (AbCSfm). This approach enables 3-D models to be extracted from multiple 
images despite initially unknown feature correspondences. It is based on image-based registration 
that is guided by predicted 3-D image appearance and a structure from motion algorithm. To 
illustrate the proposed approach, we then describe an application that uses AbCSfm to recover 3-D 
facial models from multiple images in section 3.1. Discussion of the method and a possible variant 
of it is given in section 4, with a summary subsequently provided. 

2 General approach 

We have developed an approach that allows us to recover a 3-D model from initially unknown 
point correspondence and an approximate 3-D template. We call this approach appearance-based 
constrained structure from motion (AbCSfm). The components of AbCSfm, as shown in Figure 1, 
are 

• Image registration (spline-based registration in our case [22] 

• Structure from motion (iterative Levenberg-Marquardt batch approach [23] 

• Appearance prediction (simple texture resampling [28]). The predicted appearance is com- 
puted based on current image point correspondences and structure from motion estimates, 
and is used to refine image registration. 

In this approach, initialization is first done by performing pair-wise spline-based registration 
using one frame as a reference, with every other frame. This establishes a set of gross point 
correspondences across the image sequence, from which the camera parameters and model shape 
are extracted. Subsequently, it iterates over three major steps: 
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1 . Appearance prediction 

In this step, for each image other than the reference image, the appearance of the 3-D model 
given the camera pose and intrinsic parameters is computed and projected onto a new image. 

2. Spline-based image registration 

The predicted image is registered with the actual image to refine the point correspondences. 

3. Structure from motion 

Using the refined point correspondences, estimate the new (usually better) estimates of the 
camera pose and intrinsic parameters, as well as 3-D model shape. 

The use of appearance-based strategy is important as it accounts for not only occlusions, but 
also perspective distortion due to changes in object pose. In contrast to Lowe's approach [13] 
which uses edges, we use whole predicted images. 

2.1 Tracking by spline-based registration 

In the spline-based registration framework [22, 24], a new image I 2 is registered to an initial base 
image Ii using a sum of squared differences formula 

E({ui, Vi}) = Y^hixi + Ui, yi + Vi) - h(xi, yi)] 2 , (1) 

i 

where the {u^, ^}'s are the per-pixel/?ow estimates. 

In this registration technique, the flow estimates {u i: v^] are represented using two-dimensional 
splines controlled by a smaller number of displacement estimates u j and iij which lie on a coarser 
spline control grid (Figure 2). This is in contrast to representing them as completely independent 
quantities (and thus having an underconstrained problem). The value for the displacement at a 
pixel i can be written as 

or M=£,JM, (2) 
V v{Xi,yi) J j \vj J \ v i J j V v i ) 

where the Bj(x, y) are called the basis functions and are only non-zero over a small interval (finite 
support). The = Bj(xi, y^) are called weights to emphasize that the (u^, v^) are known linear 
combinations of the (uj, iij). 

In the current implementation, the spline control grid is a regular subsampling of the pixel grid, 
Xj = mxi, ijj = myi, so that each set ofmxm pixels corresponds to a single spline patch. We use 
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Figure 1: General approach of appearance-based constrained structure from motion (AbCSfm). 
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Figure 2: Displacement spline: the spline control vertices {(v,j, Vj)} are shown as circles (o) and 
the pixel displacements V{)} are shown as pluses (+) [22]. 

bilinear basis functions, i.e., Bj{x,y) = max((l — \x — Xj\/m)(l — \y — yj\/m),0) (see [22] for 
a discussion of other possible bases). The local spline -based flow parameters are recovered using 
a variant of the Levenberg-Marquardt iterative non-linear minimization technique [17]. 
We also modified (2) to include the weights rriij associated with a mask as follows: 




E 



TYlij Wij 



(3) 



where = 1 or 0 if the corresponding pixel is in the object or background area respectively. 
This is necessary to prevent registration of the background areas influencing registration of the 
projected model areas across images, rriij can also assume values between 0 and 1, especially 
during the hierarchical search where the images are subsampled and the intensities averaged. 



2.2 General structure from motion 

The formulation of recovering structure from motion is based on that of [23]. Essentially, we are 
trying to recover a set of 3-D structure parameters and time- varying motion parameters Tj from 
a set of observed image features Ujj. The general equation linking a 2D image feature location Ujj 
in frame j to its 3-D position pj (i is the track index) is 



(4) 
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where the perspective projection transformation V() is applied to a cascaded series of rigid trans- 
formation if . Each transformation is in turn defined by 



if x 



Rf x + if 



(5) 



where R( fc ) is a rotation matrix and tf is a translation applied after the rotation. Within each of 
the cascaded transforms, the motion parameters may be time- varying (the j subscript is present) or 
fixed (the subscript is dropped). 

The general camera-centered perspective projection equation is 




(x\ 

y 

V z ) 



(6) 



where / is a product of the focal length of the camera and the pixel array scale factor, r is the image 
aspect ratio, a is the image skew, and (uq, vq) is the principal point. In theory, for general camera 
motion with constant intrinsic parameters, three views are sufficient to recover structure, camera 
motion, and all five camera intrinsic parameters [7, 20]. For stability, we assume only one intrinsic 
camera parameters matter, namely the focal length (the aspect ratio is assumed to be unity). 
An alternative object-centered formulation (a more general version of [23]) which we use is 




y 

V z ) 



sx+rjay 
rsy 



1+riz 



U 0 



1+riz 
rsy 

1+VZ 



(7) 



with the reasonable assumption that a = 0 and (u 0 ,v 0 ) = (0,0). Here, we assume that the 
(x, y, z) coordinates before projection are with respect to a reference frame that has been displaced 
away from the camera by a distance t z along the optical axis, 1 with s = f/t z and r\ = l/t z . The 
projection parameter s can be interpreted as a scale factor and r\ as a perspective distortion factor. 
Our alternative perspective formulation results in a more robust recovery of camera parameters 
under weak perspective, where 77 <C 1, and assuming (u 0 ,v 0 ) Rj (0,0) and a 0, we have 
V{x, y, z) T (sx, rsy) T . This is because s and rs can be much more reliably recovered than 77, 
in comparison with the old formulation where / and t z are very highly correlated. 



'if we wish, we can view t z as the z component of the original global translation which is absorbed into the 
projection equation, and then set the third component of t to zero. 
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2.3 Least-squares minimization with geometric constraints 

The Levenberg-Marquardt algorithm [17] is used to solve for the structure and motion parameters. 
Without the geometric constraints, formulation is exactly that of [23]. We are, instead, trying to 
minimize 

£ a ii(a) = £ sfm (a) + S geom (a) (8) 

where 

£ s fm (a) = J2 E C H I U H ~ V («y ) 1 2 ( 9 ) 

i j 

is the usual structure from motion objective function that minimizes deviation from observed point 
feature positions. V() is given in (4), and 

, T 



(pf,mj,mj) (10) 



is the vector of structure and motion parameters which determine the image of point i in frame 
j. The vector a contains all of the unknown structure and motion parameters, including the 3-D 
points pi, the time-dependent motion parameters m^, and the global motion/calibration parameters 
m g . The weight c^- in (9) describes our confidence in measurement iijj, and is normally set to the 
inverse variance a^ 2 . Implementational details are given in [23]. In our case, we set c,j to be a 
value proportional to the least amount of local texture indicated by the minimum eigenvalue of the 
local Hessian. The local Hessian H is given by 



H 



Ixly I v 



(ID 



± x ± y Z^yy 1 y 

W being the local window centered at (x, y) and [I x , I y ) is the intensity gradient at (x, y). If e m ; n) jj 
is the minimum eigenvalue at point i in frame j, then 

di = 6min ' ij (12) 

This is particularly important in the case of face model recovery because of the possible lack 
of texture on parts of the face, such as the cheeks and forehead areas. Using this metric for Cij 
downplays the importance of points on these relatively untextured areas (see, for example, [18, 
24]). To account for occlusions, is set to zero if the corresponding point is predicted to be 
hidden. 

The other term in (8) is 

£geom(a) = £ (otilhi ~ h a t \ 2 + AlPi " P° I') , (13) 



2.3 Least-squares minimization with geometric constraints 
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which is the additional geometric constraints that reduces the deformation of the template or refer- 
ence 3-D model. The quantities with the superscript 0 refers to the reference 3-D model that is to 
be deformed, hi is the perpendicular distance of point p« to the plane passing through its nearest 
neighbors (three in our case). In other words, if Hi is the best fit plane of the neighbor points of p^, 
and p • rij = (ij is the equation of flj, then 



hi 



n, 



di 



(14) 



aij is the weight associated to the preservation of local height (in a sense, preserving curvature), 
and pi is the weight associated with the preservation of the reference 3-D position. The weights 
can be made to vary from node to node, or made constant across all nodes, as in our case. 
The Levenberg-Marquardt algorithm first forms the approximate Hessian matrix 



5> 



da 



da 



(15) 



where B($) is a matrix which is zero everywhere except at the diagonal entries corresponding to 
the ith 3-D point. The weighted gradient vector is 



b = E 



da 



Si 



(16) 



where gj = (0...p- ...0) , and 



a>i(hi - h°) 



' dhi 



A(Pi-p-) 



cti(hi - h°i)rii + pi(pi - p°), 



(17) 



from (14) and using the simplifying assumption that each node position is independent of its neigh- 
bors (not strictly true), e^- = Ujj — Vfaj) is the image plane error of point i in frame j. 

Given a current estimate of a, it computes an increment 5a towards the local minimum by 
solving 

(A + AI)5a = -b, (18) 

where A is a stabilizing factor which varies over time [17]. 

We also impose the line-of-sight constraint on the recovered 3-D point with respect to the 
reference image. 
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2.4 Generating predicted appearance 

It is relatively easy to render the model given the 3-D surface model (with its facets and vertices) 
and its position and orientation. The object facets are sorted in order of decreasing depth relative 
to the camera, and then rendered by texture-mapping the facets in the same order. The rendering 
technique used in our work is a standard technique in computer graphics, and can be found in [28]. 

A 3-D model that is a good candidate for our proposed approach is the human face model. 
Its structure is known and using conventional stereo techniques are not very reliable because the 
human face usually has significant portions of relatively untextured regions. 

3 Application: Mapping new faces to 3-D DECface 
3.1 DECface 

DECface is a system that facilitates the development of applications requiring a real-time lip- 
synchronized synthetic face [26]. Originally based on the X Window System and the audio facili- 
ties of DECtalk and AF [12], DECface has been built with a simple interface protocol to support the 
development of face-related applications. The fundamental components of DECface are software 
speech synthesis, AF (AudioFile), and face modeling. 

Of particular importance to us is the face modeling component. It involves texture-mapping 
frontal view face images (synthetic or real) onto a correctly-shaped wireframe. 

Topologies for facial synthesis are typically created from explicit 3D polygons [15]. For sim- 
plicity, we construct a simple 2D representation of the full frontal view because, for the most part, 
personal interactions occur face-to-face. This model consists of 200 polygons of which 50 rep- 
resent the mouth and an additional 20 represent the teeth (Figure 3). The jaw nodes are moved 
vertically as a function of displacement of the corners of the mouth [3]. The lower teeth are dis- 
placed along with the lower jaw. Eyelids are created from a double set of nodes describing the 
upper lid, such that as they move, the lids close. 

The canonical representation is originally mapped onto the individual's image mostly by hand. 
This requires the careful placement of key nodes to certain locations, as illustrated in Figure 3 in 
particular, the corners of the lips and eyes, the placement of the chin and eyebrows, as well as the 
overall margins of the face. 

To generate facial expressions within DECface, two primary muscle types were implemented: 
linear and sheet. When orchestrated together, these muscles can create universally recognized 
facial expressions such as anger, fear, surprise, disgust, sadness and happiness. These muscle 



3.2 Mapping faces using one input image 
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Figure 3: Reconfigured facial geometry on the face image. Notice the close alignment of the nodes 
around the eyes, mouth, chin and face margins. 

types can be described as a geometric deformation function of which the linear muscle has the 
simplest derivation (for more details see [25]). 

DECface is currently being used as a visual and audio feedback mechanism for the Smart 
Kiosk project at Cambridge Research Lab, Digital Equipment Corp. [27]. The Smart Kiosk can 
be considered as an enhanced version of the Automatic Teller Machine, with the added capability 
of being able to interact with the user through body tracking, and gesture and speech recognition. 
DECface is used to personalize the interaction between the Smark Kiosk and the user. This ob- 
jective is achieved partly by its ability to communicate its focus of attention to the user population 
through the gaze behavior of eye contact. 

3.2 Mapping faces using one input image 

As mentioned in the previous section, mapping new faces to DECface involves texture-mapping 
frontal view face images (synthetic or real) onto a correctly-shaped wireframe. The original 
method to generate DECface with a new face is to manually adjust every node, which is a very 
tedious process. A "generic" separate face (whose DECface topology and 3-D distribution is 
known) is used as a reference during the process of moving each node within the new face im- 
age. This node-moving process is equivalent to the transfer of z information from the "generic" 
face to the new face. We have investigated methods to automate this process by using templates of 
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Figure 4: Initial state. The generic face whose DECface topology is known is shown at the left 
most. The other three images are the input images, with the reference image being the second 
image from the left. 



facial features such as the eyes, mouth, and face profile. 

Because only one face input image is used, to generate the appropriate 3-D version of DECface, 
the canonical height distribution is preserved. This is, however, not always desirable, especially 
since many human faces have significantly different facial shapes. As a result, to preserve as much 
as possible the correct shape, we use three input images, each showing a different pose of the face, 
with one showing the frontal face pose. It is possible, of course, to use two or more than three 
images to achieve the same goal. 



3.3 Mapping faces using three input images 

In our work, we use three images of the face at different orientations, with one of them at a frontal 
pose and used as the reference image. As before, we assume all camera parameters, intrinsic and 
extrinsic, not known (except that the aspect ratio is one, the image skew is zero, and the principal 
point is at the image center). We also assume that the point correspondences between the generic 
face and the reference face has been done as in described in the previous section. This is the 
same as assuming that the reference shape of the model has been initialized. Note, however, that 
the point correspondences across the image sequence are not known. 

We set both a, and 0 t in (13) to 0.25. As mentioned before, the feature track finetuning step 
involves using the spline-based tracker on the predicted appearance and actual image. However, 
because the prediction does not involve the background, only the predicted face image portion of 
the image is involved; the weights associated with the background are set to zero in the spline- 
based tracker. 
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Figure 5: State immediately after performing spline-based registration for the second and third 
images in the sequence. 

An example is shown in Figures 4-8. A comparison between the original 3-D face model and 
the deformed 3-D face model is shown in Figure 9. As can be seen, the resulting 3-D face has been 
horizontally stretched somewhat. If the geometric constraints are not imposed (except for just the 
simple line-of-sight constraint), then the resulting 3-D face model is quite badly deformed, as seen 
from Figure 10. 

The input images of another face is shown in Figure 1 1 . The resulting face model rendered at 
three different viewpoints is displayed in Figure 12. As can be seen from the side-by-side visual 
comparison of the 3-D face models prior to and after deformation (Figure 13), the 3-D model has 
been again stretched horizontally. In addition, the shape of the forehead is made rounder. 

4 Discussion 

The algorithm may easily fail if the change in object appearance across image sequence is too 
drastic from one frame to another. In our application of 3-D face modeling, it tolerates face rotation 
up to about 15°. 

A variant of the method would involve the direct incorporation of the optic flow term into the 
objective function (8) to give 



4n( a ) = £ S fm(a) + £ geom (a) + £ flow (u) 



(19) 
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Figure 8: Appearance of final 3-D face model at various poses. 
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Figure 9: Side views of original (left) and deformed (right) 3-D meshes for the face in Figure 4. 




Figure 10: Appearance of final 3-D face model at various poses (with no geometric constraints, 
apart from line-of-sight). 




Figure 1 1 : Input images of another face. 
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Figure 12: Appearance of final 3-D face model at various poses (from input images shown in 
Figure 11. 




Figure 13: Side views of original (left) and deformed (right) 3-D meshes for the face in Figure 11. 
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where 

£flow(u) = 7tj|A(uii) - ^-(uij)! 2 (20) 

i j>l 

with Ij(uij) being the intensity (or color) at Uj on frame j, and 7^ is the weight associated with 
the point Ujj. Note that in our particular application of facial model recovery, since the first frame 
is the reference frame, u a is kept constant throughout. 

One problem with directly embedding this term in the structure from motion module is that the 
flow error term is local and thus unable to account for large motions. It would either require that 
the initial model pose be quite close to the true model pose, or the addition of a hierarchical scheme 
similar to that implemented in the spline -based registration method. Otherwise, the system is likely 
to have better convergence properties if the tracking is performed outside the structure from motion 
loop. In the current implementation, while having the small perturbations of the model pose would 
be desirable from the computational point of view (but not from the accuracy point of view), this 
is not a requirement. 

In addition, using the flow error term directly may not be efficient from the computational 
point of view. This is because at every iteration and incremental step, a new predicted appearance 
has to be computed. This operation is rather computationally expensive, especially if the size of 
the projected model is large. Having the tracking module only loosely coupled with structure from 
motion results in fewer number of iterations in computing the predicted object appearance. Finally, 
there is the non-trivial question of assigning the weights 7^ relative to the structure from motion 
and geometric constraint related weights. 

Geometric constraints on the face deformation in other forms can also be used. An example 
would be to use the most dominant few deformation vectors based on SVD analysis of multiple 
training 3-D faces [9]. A similar approach would be to apply nodal analysis on the multiple training 
3-D faces [16, 6] to extract common and permissible deformations in terms of nonrigid modes. 

5 Summary 

We have described an algorithm called appearance-based constrained structure from motion (AbCSfm) 
that allows 3-D models to be extracted directly from a sequence of uncalibrated images. It is not 
necessary to precompute feature correspondences across the image sequence. The algorithm dy- 
namically determines the feature correspondences, estimates the structure and camera motion, and 
uses them to predict the object appearance in order to refine the feature correspondences. 

We have used the algorithm to model 3-D faces from a small number of input images, and 
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results have shown the algorithm to be robust and have good convergence properties. 
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